[SPARK-32877][SQL] Fix Hive UDF not support decimal type in complex type#29749
[SPARK-32877][SQL] Fix Hive UDF not support decimal type in complex type#29749ulysses-you wants to merge 5 commits intoapache:masterfrom
Conversation
There was a problem hiding this comment.
Generally, UDF just has one method of evaluate.
|
Test build #128642 has finished for PR 29749 at commit
|
|
Test build #128674 has finished for PR 29749 at commit
|
|
Test build #128696 has finished for PR 29749 at commit
|
There was a problem hiding this comment.
Why do we need special handling for decimal types?
There was a problem hiding this comment.
It's a compatible issue. In normal case, data type is converted by Hive ObjectInspector at running time. But Hive not support input decimal type when method required double type. Unfortunately the default type of 1.1 is different between Spark and Hive, which are decimal and double. Then caused this issue.
There was a problem hiding this comment.
Ah, I see. Could you describe it in the PR description?
There was a problem hiding this comment.
Sorry I didn't get it. ImplicitCastInputTypes can implicit cast decimal to double, doesn't it?
There was a problem hiding this comment.
In first commit I did it for all types but test not passed. The reason is a UDF required an Object type.
We convert Java Object Type to NullType in HiveInspectors.javaTypeToDataType (seems this can be changed ?)
// Hive seems to return this for struct types?
case c: Class[_] if c == classOf[java.lang.Object] => NullType
So we can't reflect the UDF method using a NullType, the error msg is:
in query: cannot resolve 'example_format('%o', 93)' due to data type mismatch: argument 2 requires array<null> type, however, '93' is of int type.; line 1 pos 7;
There was a problem hiding this comment.
if a hive udf requires Object, I think it means AnyDataType. We should only special case it.
There was a problem hiding this comment.
How about this ?
val expectTypes = method.getGenericParameterTypes.map(javaTypeToDataType)
if (expectTypes.exists(_.existsRecursively(_.isInstanceOf[NullType]))) {
children.map(_.dataType)
} else {
expectTypes
}
There was a problem hiding this comment.
method.getGenericParameterTypes.map(javaTypeToDataType).map { dt =>
if (dt.existsRecursively(_.isInstanceOf[NullType])) AnyDataType else dt
}
There was a problem hiding this comment.
Seems we should check data type and replace NullType to AnyDataType on by one to avoid such case Map<Double, Object> input.
I will do an another check if we can change HiveInspectors.javaTypeToDataType that using AnyDataType directly.
|
also cc @cloud-fan @dongjoon-hyun the similar issue with #13930 |
| with Logging | ||
| with UserDefinedExpression { | ||
| with UserDefinedExpression | ||
| with ImplicitCastInputTypes { |
There was a problem hiding this comment.
ScalaUDF doesn't extend ImplicitCastInputTypes either. Does it have the same problem?
There was a problem hiding this comment.
No, ImplicitTypeCasts has checked ScalaUDF as case udf: ScalaUDF if udf.inputTypes.nonEmpty =>
There was a problem hiding this comment.
I'm confused. The expected type should be defined by the function signature, but not the actual function inputs. What are we doing here?
As an example, ScalaUDF.inputTypes is derived from the function signature (captured by encoders).
There was a problem hiding this comment.
The ScalaUDF also implicit convert input type to expected type at ImplicitTypeCasts.
Let's say we have a udf Array[Double] => Double = { data => data.sum } and we run spark.sql("select udf(array(1.0, 1.1, 1.2))"). Then ImplicitTypeCasts will cast array<decimal> to array<double>.
But Hive udf can't enjoy it, now we only use Hive ObjectInspector to convert data type at running time. It's the difference.
There was a problem hiding this comment.
Ah I misread the code. We get children data type only to skip decimal.
|
Test build #128879 has finished for PR 29749 at commit
|
|
My last concern is to check |
|
|
|
Seems there is somethings wrong, |
|
After checked Hive code, it's a Hive version issue. In Hive 2.3.x this pr will not happen since the HIVE-13380 merged. And what I said is based on Hive 1.2.x. An other thing is we cannot implicit cast input type because we need use children data type to reflect UDF method first and get exception if it fails. So we have no chance to go in Sorry for the mistake, seems it's not need to fix the issue. |
|
Thanks for the investigation! You can probably open a new PR to add comments for |
|
thanks @cloud-fan @maropu |
|
Hi, @ulysses-you and @cloud-fan and @maropu ? |
| } | ||
| } | ||
|
|
||
| test("SPARK-32877: Fix Hive UDF not support decimal type in complex type") { |
There was a problem hiding this comment.
If this is not covered in any other test case, it looks worth of having this.
|
I'm wondering if this test fails with '-Phive1.2' still. |
|
@dongjoon-hyun I will open a new PR that only add test and some comment. |
What changes were proposed in this pull request?
This pr aims to support Hive UDF when input type is not expected. We make
HiveSimpleUDFextendsImplicitCastInputTypesthen usingAnalyzerto implicit cast input data type.Why are the changes needed?
It's a compatible issue. In normal case, data type is converted by Hive ObjectInspector at running time. But Hive not support input decimal type when method required double type. Unfortunately the default type of 1.1 is different between Spark and Hive, which are decimal and double. Then caused this issue.
Before this pr, we failed in this code.
Does this PR introduce any user-facing change?
Yes, user can get correct result.
How was this patch tested?
add test.